Causal Relationships in the Quantitative Social Sciences I

PSCI 3300.003 Political Science Research Methods

A. Jordan Nafa

University of North Texas

September 13th, 2022

Overview

  • Better understanding of correlation and causation

  • Ways of thinking about and formally expressing causal relationships

  • No class Thursday, I will be at a conference in Canada

  • Research Question Assignment is due Sunday, September 18th

  • Problem set two will be posted on Canvas by the end of the day today

\[ \definecolor{treat}{RGB}{27,208,213} \definecolor{outcome}{RGB}{98,252,107} \definecolor{baseconf}{RGB}{244,199,58} \definecolor{covariates}{RGB}{178,26,1} \definecolor{index}{RGB}{37,236,167} \definecolor{timeid}{RGB}{244,101,22} \definecolor{mu}{RGB}{71,119,239} \definecolor{sigma}{RGB}{219,58,7} \newcommand{normalcolor}{\color{white}} \newcommand{treat}[1]{\color{treat} #1 \normalcolor} \newcommand{resp}[1]{\color{outcome} #1 \normalcolor} \newcommand{sample}[1]{\color{baseconf} #1 \normalcolor} \newcommand{covar}[1]{\color{covariates} #1 \normalcolor} \newcommand{obs}[1]{\color{index} #1 \normalcolor} \newcommand{tim}[1]{\color{timeid} #1 \normalcolor} \newcommand{mean}[1]{\color{mu} #1 \normalcolor} \newcommand{vari}[1]{\color{sigma} #1 \normalcolor} \]

What is Correlation?

Correlation is the degree to which two or more features of the world tend to occur in tandem.

  • We’ll call these features of the world “variables”

  • If two variables \(\treat{X}\) and \(\resp{Y}\) tend to occur together or increase at the same rate, we would say they are positively correlated

  • If the occurrence of \(\treat{X}\) unrelated to \(\resp{Y}\), we would say these two variables are uncorrelated

  • If when \(\treat{X}\) occurs we are less likely to observe \(\resp{Y}\), we would say these two variables are negatively correlated

The Resource Curse

Consider the case of the resource curse in comparative politics–an alleged negative correlation between dependence on oil production and democracy

Not a Major Oil Producer Major Oil Producer
Democracy 78 15
Non-Democracy 58 26
  • Probability that a country is a democracy and a major oil producer

    • \(\Pr(\mathrm{Democracy} | \mathrm{No~Oil}) = \frac{78}{78+58} \approx 0.5735\)

    • \(\Pr(\mathrm{Democracy} | \mathrm{Oil}) = \frac{15}{15+26} \approx 0.3659\)

Example: The Resource Curse

Consider the case of the resource curse in comparative politics–an alleged negative correlation between dependence on oil production and democracy

Not a Major Oil Producer Major Oil Producer Pr(Oil)
Democracy 78 15 0.1613
Autocracy 58 26 0.3095
Pr(Democracy) 0.5735 0.3659
  • Oil producing countries are less likely to be democracies than non-oil producing countries. Why?

Democracy and Gender Equality

What is Correlation Good For?

  • Description

    • Suppose we want to know whether countries where gender equality is higher are more democratic on average

    • We might be interested in the correlation between gender equality and democracy

    • If our data is good, we can provide a yes or no answer to this question with few assumptions needed

What is Correlation Good For?

We can estimate this correlation in R with a simple linear model and data from the Varieties of Democracy project’s {vdemdata} package

# Load the tidyverse library
pacman::p_load(
  "tidyverse",
  "latex2exp"
  )

# Install the vdemdata package if you didn't read the R programming tutorial
remotes::install_github("vdeminstitute/vdemdata")

What is Correlation Good For?

We can estimate this correlation in R with a simple linear model and data from the Varieties of Democracy project’s {vdemdata} package

# Get the data we need from the vdemdata package
vdem_df <- vdemdata::vdem %>% 
  # We'll use just the year 2018 here for simplicity
  filter(year == 2018) %>%  
  # Transmute a subset of the data for plotting
  transmute(
    country_name, 
    v2x_polyarchy = v2x_polyarchy*10, 
    v2x_gender = v2x_gender*10
    )

# Estimate the linear relationship
lm_democ_gender <- lm(v2x_polyarchy ~ v2x_gender, data = vdem_df)

# Print a summary of the result
broom::tidy(lm_democ_gender)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    -3.03    0.486      -6.24 3.14e- 9
2 v2x_gender      1.12    0.0642     17.5  2.35e-40

Democracy and Gender Equality

What is Correlation Good For?

  • Prediction and Forecasting

    • Suppose we want to predict which financial transactions are likely to be fraudulent

    • With large amounts of data on legitimate consumer transactions, we could develop a model that detects anomalies that are likely to be fraudulent

    • Major financial firms have entire teams dedicated to predicting and preventing fraud in this manner

    • Other examples of prediction include election forecasting, self-driving cars, and many other tasks

What is Correlation Good For?

  • Causal Inference

    • Suppose we want to know if high school students would be more successful in life if they were forced to take calculus

    • The observed correlation between calculus and future success might be useful

    • But we would have to assume that the students taking calculus are otherwise the same as everyone else in terms of their underlying chances of success

    • Aside from very special circumstances, this kind of assumption will be hard to defend

    • As a general rule, correlation does not imply causation

Descriptive Statistics

For every variable we observe, we can compute a number of different statistics. Three that are particularly useful for understanding data are mean, variance, and standard deviation

  • Mean \(\mean{\mu}_{\treat{x}}\): \(\frac{\sum_{\obs{i}=\obs{1}}^{\sample{n}} \treat{x}_{\obs{i}}}{\sample{N}}\)

  • Variance \(\vari{\sigma}_{\treat{x}}^{2}\): \(\frac{\sum_{\obs{i}=\obs{1}}^{\sample{n}} (\treat{x}_{\obs{i}} - \mean{\mu}_{\treat{x}})^{2}}{\sample{N}}\)

  • Standard Deviation \(\vari{\sigma}_{\treat{x}}\): \(\sqrt{\vari{\sigma}_{\treat{x}}^{2}}\)

  • The mean and variance are known as the first and second moments of a variable’s distribution

Univariate Descriptive Statistics in R

Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R

Univariate Descriptive Statistics in R

Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R

# Mean of women's political empowerment in 2018
mean(vdem_df$v2x_gender, na.rm = TRUE)
[1] 7.355531

Univariate Descriptive Statistics in R

Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R

# Mean of women's political empowerment in 2018
mean(vdem_df$v2x_gender, na.rm = TRUE)
[1] 7.355531
# Standard deviation of women's political empowerment in 2018
sd(vdem_df$v2x_gender, na.rm = TRUE)
[1] 1.81225

Univariate Descriptive Statistics in R

Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R

# Mean of women's political empowerment in 2018
mean(vdem_df$v2x_gender, na.rm = TRUE)
[1] 7.355531
# Standard deviation of women's political empowerment in 2018
sd(vdem_df$v2x_gender, na.rm = TRUE)
[1] 1.81225
# Variance of women's political empowerment in 2018
var(vdem_df$v2x_gender, na.rm = TRUE)
[1] 3.284252

Univariate Descriptive Statistics in R

Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R

# Mean of women's political empowerment in 2018
mean(vdem_df$v2x_gender, na.rm = TRUE)
[1] 7.355531
# Standard deviation of women's political empowerment in 2018
sd(vdem_df$v2x_gender, na.rm = TRUE)
[1] 1.81225
# Variance of women's political empowerment in 2018
var(vdem_df$v2x_gender, na.rm = TRUE)
[1] 3.284252

These are univariate statistics, meaning they describe the distribution of a single variable

Bivariate Descriptive Statistics

Now that we’ve learned some notation, and we can compute useful statistics for a single variable, we can start thinking about how to measure the correlation between two variables

  • One useful measure of correlation is the covariance

    • \(Cov_{\treat{x}, \resp{y}} = \frac{\sum_{\obs{i}=\obs{1}}^{\sample{n}} (\treat{x}_{\obs{i}} - \mean{\mu}_{\treat{x}})(\resp{y}_{\obs{i}} - \mean{\mu}_{\resp{y}})}{\sample{N}}\)
  • A second is the correlation coefficient

    • \(\rho_{\treat{x}, \resp{y}} = \frac{Cov_{\treat{x}, \resp{y}}}{\vari{\sigma}_{\treat{x}} \cdot \vari{\sigma}_{\resp{y}}}\)

Bivariate Descriptive Statistics in R

We can calculate the covariance and correlation coefficient for two numeric variables in R

Bivariate Descriptive Statistics in R

We can calculate the covariance and correlation coefficient for two numeric variables in R

# Covariance for gender and democracy
cov(x = vdem_df$v2x_gender, y = vdem_df$v2x_polyarchy)
[1] 3.682169

Bivariate Descriptive Statistics in R

We can calculate the covariance and correlation coefficient for two numeric variables in R

# Covariance for gender and democracy
cov(x = vdem_df$v2x_gender, y = vdem_df$v2x_polyarchy)
[1] 3.682169
# Correlation coefficient for gender and democracy
cor(x = vdem_df$v2x_gender, y = vdem_df$v2x_polyarchy)
[1] 0.7954725

Bivariate Descriptive Statistics in R

We can calculate the covariance and correlation coefficient for two numeric variables in R

# Covariance for gender and democracy
cov(x = vdem_df$v2x_gender, y = vdem_df$v2x_polyarchy)
[1] 3.682169
# Correlation coefficient for gender and democracy
cor(x = vdem_df$v2x_gender, y = vdem_df$v2x_polyarchy)
[1] 0.7954725

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)

# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)

# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y

# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)

# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y

# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)

# Covariance of x and y
(cov_result <- sum_cov_xy/length(cov_xy))
[1] 3.661598

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)

# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y

# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)

# Covariance of x and y
(cov_result <- sum_cov_xy/length(cov_xy))
[1] 3.661598
# sigma_{x} sigma_{y}
sigma_xy <- sd(vdem_df$v2x_gender)*(vdem_df$v2x_polyarchy)

Bivariate Descriptive Statistics in R

We could calculate the correlation and covariance manually as well for illustrative purposes

# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)

# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)

# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y

# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)

# Covariance of x and y
(cov_result <- sum_cov_xy/length(cov_xy))
[1] 3.661598
# sigma_{x} sigma_{y}
sigma_xy <- sd(vdem_df$v2x_gender)*(vdem_df$v2x_polyarchy)

# Correlation
cov_result/(sd(vdem_df$v2x_gender)*sd(vdem_df$v2x_polyarchy))
[1] 0.7910286

Measuring Correlation

  • The correlation coefficient tells us about the tightness of the relationship between two variables.

  • But we often care more about the substantive magnitude of the relationship. How much does \(\resp{Y}\) vary as \(\treat{X}\) varies?

  • To answer this question, we want to know the slope of the regression line.

  • \(\beta = \frac{Cov_{\treat{x}, \resp{y}}}{\vari{\sigma}_{\treat{x}}^{2}}\)

  • On average, for every one-unit increase in \(\treat{X}\), \(\resp{Y}\) increases by…

Measuring Correlation

When \(y = mx + b\) Fails

  • An unfortunate fact of life, though one some social scientists all too often fail to appreciate, is that not everything is approximately linear and additive

  • In cases of non-linear relationships, calculating the linear correlation between two variables will give us the wrong answer

    • If you make stupid assumptions, you will get stupid results

    • Causation need not imply linear correlation

  • Lots of interesting relationships are non-linear and there are ways of analyzing these relationships

    • Pro-tip: The two most power tools for data analysis ever created are the histogram and the scatter plot

Non-Linear Relationships

Easiest way to to illustrate the problem is usually through simulation

# Simulate some non-linear relationships
nonlinear_sims <- tibble(
  x = runif(n = 2e3, min = -10, max = 10),
  y_posquad = x + x^2 + rnorm(2e3, 0, 2),
  y_negquad = x - x^2 + rnorm(2e3, 0, 2),
  y_sin = sin(x*3.14) + rnorm(2e3, 0, 0.5),
  y_linear = x + rnorm(2e3, 0, 2)
)

Non-Linear Relationships

Easiest way to to illustrate the problem is usually through simulation

# Simulate some non-linear relationships
nonlinear_sims <- tibble(
  x = runif(n = 2e3, min = -10, max = 10),
  y_posquad = x + x^2 + rnorm(2e3, 0, 2),
  y_negquad = x - x^2 + rnorm(2e3, 0, 2),
  y_sin = sin(x*3.14) + rnorm(2e3, 0, 0.5),
  y_linear = x + rnorm(2e3, 0, 2)
)

Then we can use ggplot2 to make scatter plots of the relationships in the simulated variables

Non-Linear Relationships

# Initiate the plot object
posquad_plot <- ggplot(nonlinear_sims, aes(x = x, y = y_posquad, fill = x)) +
  # Add the data points
  geom_point(shape = 21, size = 3) +
  # Add the linear fit
  geom_smooth(method = "lm", size = 2, se = FALSE, lty = 2, color = "white") +
  # Tweak the fill color scheme
  scale_fill_viridis_c() +
  # Labels for the plot
  labs(
    x = "X",
    y = "Y",
    title = latex2exp::TeX(r'($Y_{i} = X_{i} + X_{i}^{2} + \epsilon_{i}$)')
  ) +
  # Adjust the x axis scales
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  # Adjust the y axis scales
  scale_y_continuous(breaks = scales::pretty_breaks(n = 8))

Non-Linear Relationships

Non-Linear Relationships

# Initiate the plot object
negquad_plot <- ggplot(nonlinear_sims, aes(x = x, y = y_negquad, fill = x)) +
  # Add the data points
  geom_point(shape = 21, size = 3) +
  # Add the linear fit
  geom_smooth(method = "lm", size = 2, se = FALSE, lty = 2, color = "white") +
  # Tweak the fill color scheme
  scale_fill_viridis_c() +
  # Labels for the plot
  labs(
    x = "X",
    y = "Y",
    title = latex2exp::TeX(r'($Y_{i} = X_{i} - X_{i}^{2} + \epsilon_{i}$)')
  ) +
  # Adjust the x axis scales
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  # Adjust the y axis scales
  scale_y_continuous(breaks = scales::pretty_breaks(n = 8))

Non-Linear Relationships

Non-Linear Relationships

# Initiate the plot object
sin_plot <- ggplot(nonlinear_sims, aes(x = x, y = y_sin, fill = x)) +
  # Add the data points
  geom_point(shape = 21, size = 3) +
  # Add the linear fit
  geom_smooth(method = "lm", size = 2, se = FALSE, lty = 2, color = "white") +
  # Tweak the fill color scheme
  scale_fill_viridis_c() +
  # Labels for the plot
  labs(
    x = "X",
    y = "Y",
    title = latex2exp::TeX(r'($Y_{i} = Sin(X_{i}\cdot \pi) + \epsilon_{i}$)')
  ) +
  # Adjust the x axis scales
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  # Adjust the y axis scales
  scale_y_continuous(breaks = scales::pretty_breaks(n = 8))

Non-Linear Relationships

References